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Abstract 

This paper considers inference for conditional moment inequality models using a 
multiscale statistic. We derive the asymptotic distribution of this test statistic and use 
the result to propose feasible critical values that have a simple analytic formula. We 
also propose critical values based on a modified bootstrap procedure and prove their 
asymptotic validity. The asymptotic distribution is extreme value, and the proof uses 
new techniques to overcome several technical obstacles. We provide power results that 
show that our test detects local alternatives that approach the identified set at the best 
possible rate under a set of conditions that hold generically in the set identified case in 
a broad class of models, and that our test is adaptive to the smoothness properties of 
the data generating process. Our results also have implications for the use of moment 
selection procedures in this setting. We provide a monte carlo study and an empirical 
illustration to inference in a regression model with endogenously censored and missing 
data. 
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1 Introduction 



This paper shows how to achieve optimal power adaptively in conditional moment inequality 
models using a multiscale test statistic. Formally, the model is defined by a vector of inequal- 
ity restrictions of the form E(m(Wi, 9) \Xi) > almost surely, where m is a known parametric 
function and inequality is taken elementwise. The object of interest is the identified set Go 
of parameter values that satisfy this set of restrictions, and the goal is to form a test that has 
good power properties at alternative values of 9 near the boundary of the identified set. By 
testing the null 9 6 0q for each 9, and inverting these tests, one obtains a confidence region 



that , for each point in t 



sec 



r e iden tified set, contains this point with a prespecified probability 



Imbens and Manski 



2004, for a discussion of this and other notions of inference in this 



setting). This class of models includes numerous models used in empirical economics, in- 
cluding selection models, regression models with endogenously missing or censored data, and 
certain models of firm and consumer behavior (see below for references from the literature). 

We derive the asymptotic distribution of our test statistic and show how it can be used 
to obtain feasible critical values. We provide two methods for computing critical values for 
our test statistic and prove the validity of both using our asymptotic distribution result. 
The first is based on the asymptotic distribution itself and has the advantage of having a 
simple analytic formula that can be computed without using simulation. This is particularly 
useful in applied settings where computational issues can severely limit the applicability of 
tests that require resampling or simulation to compute critical values. The second method 
for computing critical values uses a modified bootstrap procedure. While we focus on least 
favorable critical values, both methods can be used with first stage moment selection proce- 
dures. 

We provide power results that show that our test detects alternative parameter values 
that approach the boundary of the identified set at the fastest possible rate. The test is 
adaptive in the sense that it achieves these rates for data generating processes with a range 
of smoothness properties without prior knowledge of these smoothness properties. The test 
achieves these optimal rates adaptively even without the use of first stage moment selection 
procedures, and our results show that moment selection procedures have little or no first 
order effect on power in many settings. While moment selection procedures will have some 
effect in finite samples, the results suggest that our test is less sensitive to moment selection 
than many of the procedures available in the literature. This is a particularly positive result 
for researchers who prefer not to use pre-tests because of the introduction of arbitrary user 
driven parameters or because of concerns about robustness. The test achieves rate optimal 
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power adaptively without the need for such pre-tests, and the researcher need not worry 
when using this form of our procedure that performing such a pre-test would have had a 
dramatic effect on power. 

The test statistic we consider presents several technical obstacles in deriving the asymp- 
totic distribution. Because of the variance weighting, which is needed for our test to have 
good power properties, the test statistic takes a supremum over a sequence of random pro- 
cesses for which functional central limit theorems do not hold. While similar technical issues 
have been solved i n other settings using a pprox i mations by sequences of gaussian pro cesses 
(see, for example 



Bickel and Rosenblatt 



1973 



Chernozhukov. Lee, and Rosenl . 120091 ). the 



multiscale nature of our test statistic (as opposed to test statistics based on kernels with 
a fixed sequence of bandwidths), makes the rate of approximation too poor for our pur- 
poses. In addition, the test statistic we consider takes the supremum over a process that 
is nonstationary in ways that the previous literature has not dealt with, so even deriving 
the asymptotic distribution of the supremum of the approximating gaussian process would 
require new techniques. 

To overcome this, we use methods for tail approximations to nonstationary, nongaussian 
ying them directly to the process in the sample. We use methods from 



processes, app 



Chan and Lail (120061 ) to derive tail approximations directly using a combination of moderate 



deviations results and tail equicontinuity conditions, thereby circumventing the need for 
strong approximations. We verify these conditions for our test statistic directly, and use 
these results in the derivation of the extreme value distribution. While verifying these 
conditions can be challenging, we anticipate that the techniques introduced here will be 
useful in other problems in econometrics where intermediate strong approximations are not 
available or do not give the best results. 



1.1 Related Literature 



This paper is related to the literature on partial identification and, in particular, the litera- 
ture on conditional moment inequalitie s. The t ests proposed in this pap er are most closely 
related to those studied by lArmstrong fepllbh and bhetverikov! (hoiij ) (the results in the 
present paper were developed independently and around the same time as the latter paper). 
Armstrong! ( 1201 lbl ) considers estimation of the identified set using conservative confidence 



regions. While those results could be used for the problem considered here, the methods of 
proof used in that paper lead to extr emely conservative critical values that are too large to 
be useful in most practical settings. IChetverikovl ( 120121 ) uses a different form of a statistic 
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similar to ours (the supremum is taken only over a finite set of bandwidths and points that 
cannot grow too quickly) and different methods of proof that avoid deriving an asymptotic 
distribution or even showing that one exists. From a practical perspective, our method de- 
livers an analytic formula that can be used to compute a critical value that does not require 
simulation, and also p roves the asymptot ic validity of modified bootstrap procedures, while 
the approach taken in IChetverikovl (120121 ) only allows for the latter result. 

P apers proposing o t her a p proaches to inference on cond i tional moment inequalities in- 



clude 



Andrews and Shi (2009) 



Kim J 2008) 



Khan and Tamer 



(hood) 



Chernozhukov. Lee, and Rosen 



(|2009h. iLee. Song, and Whand (l201lh . iPonomareval (J2010|), iMenzell (120081 1 and lArmstrong 
( ]2011al ). While these approaches are useful in many settings (for example, settings where 
point identification is likely, or where the researcher has prior knowledge of certain smooth- 
ness properties of the data generating process), they do not achieve optim al power adaptively 

for local power re- 
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in the generic set identified case considered here (see lArmstrong 
suits for some of these models). Indeed, the test statistic considere d here can be tho u ght o f 
as i ntrod u cing an optimal weighting to the statistics proposed by lAndrews and Shil (120091 ) 



and iKim] (120081 ). thereby allowing the tests to adaptively achieve optimal power in the set 
identified case, but leading to dramatically different behavior of the test statistics (and lead- 
ing to the technical difficulties described above for deriving asymptotic distribution results). 
The tests c onsidered in this paper can als o be t houg ht of as modifying t he kernel based 
statistics of 



Chernozhukov. Lee, and Rosenl (120091 ) and 



Ponomareva 



( 120101 ) to a multiscale 



statistic that chooses the bandwidth automatically and adaptively. As discussed above, this 
also leads to difficult technical issues not encountered in the previous literature (the gaussian 
approximations used by those papers do not give good enough rates of approximation, and 
there is additional nonstationarity in the process since it is indexed by the bandwidth as well 
as the location), which the present paper uses new techniques to circumvent. In sum, none of 
the other approaches in the literature satisfy the optimality properties of adaptively achiev- 
ing the best possible rate for detecting local alternatives in set identified models. This paper 
considers a test statistic that satisfies these optimality properties, and, because it differs in 
important ways from other statistics considered in the literature, requires new techniques to 
derive critical values and asymptotic distribution results. 

This paper is also related to the broader literature on partial identification, includ- 



ing the problem of inference on fi nitely many uncondition al 
that consider this problem include 



A 



ndrews. Berry, and Ji 



Andrews and Guggenberger ( 2009 ). Andrews and Soaresl ( 201ol ). Chernozhukov. Hong, and Tamer 



moment inequalities. Article s 



al(l2004h. 



Andrews and Jia 
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( 2007 ). Romano and Shaikh! (boioluRomano and Shaikh! (120081) .iBuenil (12010 ) 



(l2008h . iMoon and Schorfheidd (12009^ . Ilmbens and Manskil (120041 ) and IStovd ((20091). In ad- 



Beresteanu and Molinari 



dition, there have been a number of applications of partial identification , includ i ng th e 
conditional moment inequality models considered here, going back at leas t tolManskii (1990). 
here a re too many references to n ame all of them here, but papers in clude 



Pakes. Porter. Ho. and Ishii 



(l2006! ). lManski and Tamer! (12002! ) . and ICiliberto and Tamer! (J2009J). 



From a technical standpoint, this paper is related to other papers deriving extreme value 



result s for supremum statistics. The litera ture goes back at least to 



Bickel and Rosenblatt 



(119731 ) , and includes recent papers such as IChernozhukov. Lee, and Rosen! ( 120091 ) . The ar- 
guments used in the proof in this paper are substantially different, as they do not use 
intermediate approximations by gaussian processes. As discussed in more detail in Section 
[2j the multiscale nature of the test statistic considered here makes the rates in these ap- 
proximations too poor for our purposes. Our result also differs in that the test statistic we 
consider takes a supremum over a process that is nonstationary in ways not considered in 
the previous literature. W hile extreme value results have been derived for nonstationary 
processes (see, for example, iLee. Linton, and Whang! . 120091 ). these results use other aspects 
of the structure of these problems that do not apply in our case. 

The test statistic considered in this paper is related to scan statistics considered in the 
statistics l iterature. This paper is also r elated to the literature on adaptive inference. In 
particular, iDumbgen and Spokoinyl ( 120011 ) apply a similar approach to ours in a one dimen- 
sional gaussian setting. This paper contributes to these literatures by deriving extreme value 
approximations in a setting with a multidimensional, nongaussian, nonstationary process, 
which requires new techniques for the same reasons described above. iHorowitz and Spokoiny 
( 1200 ll ) propose a different test for a related goodness of fit testing problem. Those authors 
consider adaptivity with respect to a different class of alternatives than t he on e in this paper, 
leading to a different approach. In particular, IHorowitz and Spokoiny (hooih consider min- 
imax rates with respect to L2 distance in a two-sided testing problem. In the set identified 
parametric conditional moment inequality models considered in the present paper paper, eu- 
clid ean distance on the parameter space translates to distance for the conditional mean 



see 



Armstrong! . l2011bl and the power results in S ection B] of this paper) , le ading to a differ- 



ent approach (while we use a supremum statistic, IHorowitz and Spokoinyl use a supremum 
over integration based statistics). 
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1.2 Notation and Plan for Paper 

We use the following notation throughout the rest of the paper. For observations {Zt}f =1 , 
the sample mean of a function g is given by E n g{Z^) = ~ J27=i d{^i)- Inequalities are defined 
for vectors as holding elementwise. For vectors a and b, a A b is the elementwise minimum, 
and a V b is the elementwise maximum. 

The rest of the paper is organized as follows. Section [2] describes the setup and gives the 
main asymptotic distribution result. Section |3] derives critical values for the test based on 
this result. Section @] provides results on the power of the test. Section |5] reports the results 
of a monte carlo study. Section [6] reports the results of an illustrative empirical application. 
Section [7] concludes. Appendices to the main text contain proofs of the results in the main 
text, as well as some additional results mentioned in the main text, including versions of 
some of the results from the body of the paper that incorporate uniformity in the underlying 
distribution. 

2 Setup and Asymptotic Distribution 

We observe iid data {X u Wi}? =1 where Xi G R dx and Wi G R dw . We wish to test the null 
hypothesis 

E(m(Wi,e)\Xi)>0 a.s. (1) 

where m : M. dw x — > M. dY is a known measurable function and 9 G C M. de is a fixed 
parameter value. We use the notation m(9,x) to denote a version of E(m(Wi, 9)\Xi = x). 
Typically, the null (CQ) is tested for each value of 9 in order to obtain a confidence region for 
parameters that are consistent with the model. The model may not be point identified, in 
the sense that there may be more than one value of 9 consistent with (JTJ, and the tests in this 
paper are specifically geared towards this case. In general, we denote by ©o the identified 
set of parameter values that are consisent with the restrictions in ([T|): 

6 = {9 G e\E(m(Wi,9)\Xi) > a.s.}. 

While the above setup considers only a single probability distribution, this is only for no- 
tational convenience. We show in Appendix |A] that our test controls the asymptotic size 
uniformly over appropriate classes of underlying distributions. 
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We note that, while the above setup is written in terms of a parametric model m(Wi, 8), 
our methods apply more generally to test the inequality E{Y,j\Xi) > a.s., where Yj is any 
random variable satisfying certain regularity conditions below. The reason we impose this 
additional structure is that our tests are designed to have good power properties for values 
of 9 that violate the null, but are near the identified set 0o of parameters that satisfy the 
null. Since our goal is to distinguish parameter values in 0o from nearby parameter values 
outside of 6 , we state our power results in terms of sequences of parameter values and the 
rate at which they approach the boundary of Go (see Section H]). By deriving our results in 
terms of alternative parameter values rather than mathematical notions of distances of data 
generating processes, we obtain power results that are immediately applicable to assessing 
the statistical accuracy of confidence regions based on our tests in economic models. 

Consider the test statistic T n = (T n> x, ■ ■ ■ , T n ^ Y ) where 



inf E, 



mj(Wi,0)I(s <Xi<s + t) 



I(s,t)CX,t>t„ &n,j{s,t,' 



t n is a sequence going to zero, X is the convex hull of {Xj}" =1 , I(s, t) = [si, s% + ti) x • • • x 

[s dx , s dx + t dx ) and 

Si As, t, 6) = E n m 3 (W u 9) 2 I{s <X l<S + t)- [E n m 3 (W u 6)I(s < X t < s + t)} 2 . 



We can form a test by rejecting for large values of S n = S(T n ), where S : R Y — > R is some 
function that is nondecreasing in each argument. For concreteness, we take S to be function 
that takes the maximum of the components of T n : 



S n = S n (9) = max T n A0). 

l<j<d.Y 



It is worth commenting on the properties of this test statistic that differ from other 
statistics for this problem, and how they lead to optimal power properties fo r set id entified 



models. We discuss this briefly here, and refer the reader to lArmstrongi (120121 ). which 
contains a comprehensive treatment of the power properties of different approaches, for 
details. In testing E(m(Wi, 0)\Xi) > a.s., one can use essentially any test statistic that 
estimates E(m(Wi, 9)\Xi) and takes some function of this that is large in magnitude when 
this estimate is negative for some value of x. Most conditional mean estimates can be thought 
of as using an instrumental variables approach, where the inequality E(m(Wi, 0)\Xi) > a.s. 
is transformed into a set of inequalities Em(Wi, 8)g(Xi) > all where g ranges over a set Q n 
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that is infinite or increases with the sample size (e.g., a kernel estimator does this with the 
functions g given by h((Xi — x)/h n ) where h n goes to zero at some rate and x ranges over the 
support of Xi) and the inequality may only hold approximately if g is not positive everywhere 
(e.g. if higher order kernels or sieves are used). Once a class Q n is decided on, one faces the 
decision of how to transform estimates of Em(W i: 9)g(Xi) into a statistic that is positive 
and large in magnitude whenever one of these estimates is negative and large in magnitude. 
This includes deciding on how to weight each function g, and how to combine them. For the 
latter problem, one can take some power of the negative part of the test statistic and add or 
integrate these over g (a Cramer- von Mises or CvM style approach), or take the maximum 
or supremum of the negative part (a Kolmogorov-Smirnov or KS approach). In addition, 
since the null space is composite, one faces a choice in how to pick the critical value, and, in 
particular, whether to choose a critical value based on the least favorable distribution in the 
null space where E(m(W i: 9) \X± — x) — for all x, or whether to use a pre-testing procedure 
that determines where the equality may hold and uses smaller critical values based on the 
results of this procedure. 

In sum, one faces the decision of (1) which instruments (or kernels or sieves, etc.) to 
use, (2) how to weight them, (3) how to combine them (integration or summing, or taking 
the supremum) and (4) how to choose the critical value. For (1), our test statistic uses 
a class of product kernels with all possible bandwidths. Using a class of functions with 
multiple scales, rather than a kernel function with a single bandwidth, allows the test to find 
the optimal bandwidth adaptively for a range of smoothness conditions. For (2), the test 
statistic S n weights each function by its variance. This weighting is essential in allowing the 
test statistic to find the instrument function that balances bias and variance in an optimal 
way for detecting a given alternative, and the improvement in power in the set identified 
case can be thought of as an optimal weighting result for moment inequality models. 

For (3) our test statistic uses a supremum (KS) criterion rather than a criterion based on 
sums or integrals (a CvM criterion). To understand why a KS approach leads to more power 
than a CvM approach, it is helpful to consider the relationship between the nonsimilarity 
of these tests on the boundary of the identified set and power at nearby alternatives. If 
a test statistic behaves differently depending on where m(x, 9) = 0, then using the most 
conservative critical value will lead to poor power in cases where nearby parameter values 
in the null space lead to the inequality binding on a small set. While moment selection 
procedures can help alleviate this, the procedures that are known to be robust in these 
settings are not aggressive enough to attain good power. KS statistics are less sensitive 
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to which moments bind since the supremum of k sample means increases at a ylog k rate, 
while the sum of the positive part increases at a polynomial rate in k. Thus, by using 
a KS criterion, our test statistic achieves good power without requiring moment selection 
procedures, and the power of the test is less sensitive to these procedures, so that the decision 
(4) has less impact on the power of the test. 
We impose the following conditions. 

Assumption 2.1. 

a. ) The distribution of m(Wi,6) conditional on Xi satisfies the following conditions. 

i. ) There exists a A > and a constant M\ such that 

S(exp(A|m i (Wi,0)|)|X i ) < M x a.s. all l<j< d Y . 

ii. ) var(mj(Wi,9)\Xi = x) is positive and continuous in x for all j . 

Hi.) corr(mj(Wi,8),mk(Wi,8)\Xi = x) is bounded away from 1 for all j ^ k. 

b. ) The support X of X^ is a compact, convex Jordan measurable set with strictly positive 

measure, and Xi has a density f that is bounded away from zero on X . 

c. ) t n -> and nt d ^ /| logt n | 4 ->■ oo. 

Part (a) imposes regularity conditions on the moments of m(Wi,9). It is worth noting 
that, while we impose some mild smoothness assumptions on the conditional variance, we 
place no assumptions on the smoothness of the conditional mean. Thus, while the power of 
our test depends on the smoothness properties of the conditional mean, our test is robust 
to very nonsmooth data generating processess. The convexity assumption in part (b) is 
imposed to simplify certain parts of the proof, and could be relaxed. The condition on t n in 
part (c) is, up to the | logt n | term, the best possible rate. As discussed further in Section [Dj 
other methods of deriving critical values for this test statistic would not allow t n to decrease 
quickly enough for the statistic to have good power. 

The following theorem gives the asymptotic distribution of this test statistic, and provides 
feasible critical values that can be calculated analytically. For a version of this theorem that 
incorporates uniformity in the underlying distribution, we refer the reader to Appendix lAl 

Theorem 2.1. Suppose that the null hypothesis (Op and Assumption \2.1\ hold for 9. Let 
c n = voli^X)/^ and let a(c n ) = (2nlogc n ) 1 ^ 2 and b(c n ) = 21ogc n + (2d x — l/2)loglogc n — 
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log(2-y/7r) . Then, for any vector r G M. dY , 

liminf P (a(c n )T n - b{c n ) < r) > P(Z < r) 

where Z is a dy dimensional vector of independent standard type I extreme value random 
variables. If, in addition rhj(8,x) = for all x and j, then 

a(c n )T n - b(cn) 4- Z. 



3 Inference 

An immediate consequence of Theorem 12 .H is a method for choosing feasible critical values for 
the test statistic S n (9) that can be computed analytically. By Theorem 12. 11 a(c n )S n — b(c n ) 
is asymptotically bounded by a random variable that is the maximum of dy standard type 
I extreme value random variables. By the properties of extreme value random variables, 
this distribution is itself type I extreme value, with cdf exp(— dy exp(— r)). Some calculation 
leads to the rejection rule 

■ + -f C /7JW - v, - _ log(dy)-log(- l0g(l -<*))+ 6(Cn) 

reject if S n (6) > q X - a where q x _ a = — . (2) 

a{c n ) 

It follows from Theorem 12. II that this test is asymptotically level a. We record this result in 
the following theorem. 

Theorem 3.1. Suppose that the null hypothesis (Qp holds for 9 and that Assumption \2.1\ 
holds. Let cji_ a be as defined in (0|). Then 

limsup P (S n (9) > qi-a) < ex.. 

n 

If, in addition, m(9, x) — for all x G X , then 

P(S n (9)>q 1 _ a )^a. 



3.1 Simulated Critical Values 

While the critical value given in ([2]) gives a valid asymptotically level a test, this critical 
value is based on extreme value approximations that may perform poorly in finite samples in 
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certain situations. While our monte carlos suggest that the analytic critical values perform 
well in many cases encountered in practice, we also propose bootstrap or simulation based 
critical values to remedy these issues in cases where the extreme value approximations do 
poorly. The asymptotic validity of these critical values follows from the same results as the 
asymptotic distribution result in Theorem 12.11 

We define our simulated critical values as follows. For each j, let M n (x) be any random 
sequence of functions that take values in X to dy x dy symmetric, positive definite, matrices. 
We require that sequence of variance matrices given by M n (x) be continuous in x and have 
correlation coefficients bounded away from one uniformly over n with probability one. One 
can choose M n (x) to be an estimate of the conditional variance matrix of the m(Wi,9), 
but this is not necessary, and M n (x) can even be chosen to be the constant function that 
takes all values to the identity matrix. For each repitition b of B simulations, we draw n 
independent outcome variables [Y* ,b }™ =1 with Y*' ~ N(Q, M n (Xi)) independent across % 
and b conditional on the data. We form the test statistic S 1 * b for this repetition by replacing 
m(Wi, 6) with Y* ,b in the definition of the test statistic. The simulated critical value is given 
by the 1 — a quantile of this bootstrap distribution: 



(Zl-a.sim — m f 



iE^<0>l-aj. 

6=1 J 



(3) 



The asymptotic validity of this test follows immediately from the version of Theorem 12.11 in 
Appendix [A] that incorporates uniformity in the underlying distribution. 

Theorem 3.2. Suppose that the null hypothesis (T7J) holds for 6 and that Assumption \2.1\ 
holds. Let qi_ a sim be as defined in |3j). Then 

limsupP(S' n (0) > qi- a>s im) < a. 

n 

If, in addition, m(6, x) — for all x e X , then 

P (S n (0) > qi- a , S im) -> a. 



3.2 Moment Selection Procedures and the Choice of t n 

The rejection probabilities of the tests defined above will converge to a when the conditional 
mean rhj(8,x) is equal to zero for all x G X for all j. If these inequalities only bind on a 
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subset of X, the rejection probability will be strictly less than a, and it would seem that 
there would be the potential for large power improvements at nearby alternatives by using a 
smaller critical value that take this into account. Perhaps surprisingly, it turns out that there 
will typically be no first order power improvement from doing this in our setting. While this 
result should certainly not be taken to mean that there will be no effect on power in finite 
samples, the result suggests that our procedure will be less sensitive to moment selection 



than other procedures in the literatur e for which moment selection has a 



power asymptotically (see, for example lArmstrongl . l2011aJ : 



Andrews and Shi 



arge effect on 



20091 ). 



To see why this holds, first, note that, it can be shown that, if rhj(6,x) = for some j 
only for values in some subset X, Theorem 12.11 will hold with X replacing X. Thus, if we 
use prior knowledge of such a set X with strictly positive volume, or find such a set with a 
first stage test, we would obtain a critical value qi- a with X replaced by X. But note that, 
regardless of X, the critical value q\- a satisfies 



Ql- 



b(c n )/a(c n ) ~ (21ogc n ) 1/2 /™ 1/2 ~ [2]ogtf* + \ogvol{X)} 1 ' 2 /n 1 ' 2 ~ (21ogt^) 1 / 2 / n 1 / a 



(and the same holds for the simulated critical value <Zi_ a)S i m ). Thus, even with prior knowl- 
edge of the contact set, the contact set would have only a second order effect on the critical 
value. 

The above calculations can also be used to understand the effect of the choice of the 
minimal window width t n on the power of the test. Suppose that t n is chosen proportional 
to n~ s for some < 5 < 1. Then, by the above calculations, we will have 



Qi- 



{2d x 5\ogn) 1 ' 2 /n 1 / 2 . 



As shown in Section HJ larger values of 5 are required to obtain optimal power properties 
for less smooth conditional means. While choosing a larger value of 5 does not affect the 
rate at which local alternatives can approach the null space and be detected (the test is 
adaptive with t n decreasing as quickly as allowed), it does have a non negligible effect on 
power through larger critical values. If t n is chosen as n~ &2 for some value 5 2 instead of some 
other value 8\ where 5\ > 5 2 , the critical value will increase by a factor of (5i/5 2 Y^ 2 - 

Note also that the critical value is, up to first order, the same as the critical value for 
a test that only takes the infemum over all s with t fixed at t n , which wo uld c orrespond to 



the ke rnel approach considered in IChernozhukov. Lee, and Rosen 



(120091 ) and 



Ponomareva 



( 120101 ) . Thus, in typical settings, there is no first order loss in power from considering larger 
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bandwidths using the multiscale approach in this paper even if the optimal bandwidth is 
known. 



4 Local Power 

This section derives asymptotic approximations to power functions by considering the power 
of these tests under sequences of alternative parameter values that approach the boundary 
of the identified set. Consider a parameter value 9 on the boundary of the identified set, 
and a sequence of local alternatives given by 9 n = 9q + ar n for some v ector a 6 M. de and s ome 



sequence of scalars r n — > 0. We impose the following conditions (see lArmstrong 



2011b. for 



verification in several examples of a set of conditions that imply Assumption 14. ip . 
Assumption 4.1. 

a. ) m(9,x) is differentiable in 6 with derivative rhg(9,x) that is continuous as a function 

of 6 uniformly in (0,x). 

b. ) For some 7, C, j and xq € X , we have rhj(9o,xo) = and, for all x in a neighborhood 

ofx , 

\fhj(9o, x) — fhj(9o, xq) I < C\\x — £o|| 7 - 

Part (b) of Assumption 14.11 is a smoothness condition on the conditional mean under 9q. 
If fhj(9o,xo) = for some xq, part (b) will hold with 7 = 1 if fhj(9o,x) has a continuous 
first derivative in x, and it will hold with 7 = 2 if rhj(9o,xo) = has a continuous second 
derivative in x and xq is on the interior of X. 

The following theorem gives local power results for sequences of local alternatives. To 
state the results, let C(-) be any bounded function on the unit sphere such that Assumption 
14.11 holds with C replaced by C((x — x )/\\x — x \\). We can always take this function to 
be a constant function under Assumption 14.11 but, using this notation, we can state power 
results that are more precise. 

Theorem 4.1. Suppose that Assumption \4-l\ holds for 9q and that Assumption \2.1\ holds 



with the constants in part (a) uniform over a neighborhood of 9 . Let 9 n = 9 + ar n for some 
a G M de and a sequence of scalars r n — > 0. Suppose that, for some index j such that part (b) 
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lim inf r T 



of Assumption 4-1 holds for j, 

/ n \7/(rfx+2 7 ) 
\2\0gtn dX ) 

/(^o) 1/2 Lzu,s<u<s + t{l™eAeo,xo)a] + C (^) HI?} du 



-7/(djf/2+7) 



> 



where U = \J^ =1 {X — Xfyj/r^, Y>jj(x) = var(rrij(Wi, 9)\Xi = x) and the right hand side is 
taken be infinity if the infimum in the brackets is zero. Then, if t n < r\{nj \ogn)^ 1 ^ dx+2 ' y ^ 
for small enough rj, we will have 

P(S n (9 n ) > -> 1. 

If rhe^Oo, xo)a is strictly negative, which will typically be the long as 6 n is outside 

of the identified set, this result shows that the power of the test approaches one as long as 
9 n approaches #o at a [nj \ogn)"'^ dx+2 ^ rate with a large enough scaling. This corre sponds 



to th e fastest rate that would be achievable even if 7 were known (see, for example, IStond . 



19821 ). Theorem 14.11 shows that our test is adaptive in the sense that it achieves this rate 
simultaneously for all 7 without prior knowledge of 7. Taking t n to be a logn term times 
n -i/d x ^ condition that t n < rj(n/ \ogn)~ l ^ dx+2l ' > will be satisfied regardless of 7. Another 
possibility is to take the smallest value of 7 that the researcher thinks is likely, and to 
choose a value of t n that is optimal for a particular data generating process and sequence 
of alternatives with that value of 7. Theorem 14.11 shows that this approach will achieve the 
optimal rate even if 7 is larger than the value used to choose t n . 



5 Monte Carlo 

We perform monte carlos with several designs based on a median regression model with 
potentially endogenously missing data. We consider a missing data model where the condi- 
tional median of W* given Xi is given by gi/ 2 (W^*|Xj) = 61 + 9 2 Xi, and W* is missing for 
some observations. Letting W t H = W* when W* is observed and 00 otherwise, this leads to 
the conditional moment inequality E[I{6\ + QyXi < Wf 1 ) — l/2\Xi\ > a.s. (in practice, 
one would form another inequality based on a lower bound for W* of —00 when W* is not 
observed, but we focus on a single moment inequality in the monte carlos for simplicity). 
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In each design, we simulate the data from a median regression given by W* = O^+O^Xi+u 
for some (9*, 9%) where u ~ unif(— 1, 1) and X^ ~ unif(0, 1). We then set W* to be missing 
with probability p{Xj) independently of W* for some function p(x) (note that, while we 
generate the data using a parameter value that satisfies missingness at random, the test is 
designed to give confidence regions that are robust to the failure of this assumption). We 
consider 3 designs with 9\ = 9\ = and p(x) given as follows for each design: 

Design 1: p(x) = .1 

Design 2: p(x) = .02 + 2 ■ .98 ■ \x - .5| 

Design 3: p(x) = .02 + 4 • .98 • (ar - .5) 2 . 

Design 1 corresponds to a flat conditional mean, while Designs 2 and 3 correspond to 7 = 1 
and 7 = 2 in Assumption 14.11 respectively. For each design, we consider the sample sizes 
n = 100,500,1000 and the truncation parameters t n = n -1 / 5 , n _1//3 , n~ 1//2 for each sample 
size. Note that n' 1 ^ is the optimal rate for t n for Design 2 and n" 1 / 5 is the optimal rate 
for t n for Design 3, while t n = n~ l l 2 is smaller than optimal for all three designs, but still 
achieves the optimal rate for local alternatives by Theorem 14.11 

For each design, we test several parameter values with # 2 fixed at and 9i varying. For 
a given design, let 9i be the largest value of 9± such that (#i,0) is in the identified set. 
First, to examine the finite sample size of the test based directly on the asymptotic distri- 
bution, we report monte carlo estimates of the true false rejection probability under (#i,0) 
and Design 1, which corresponds to a least favorable null distribution with the conditional 
moment inequality equal to zero for all x. This gives an idea of the worst (most liberal) 
size distortions one can expect from tests based on critical values calculated directly from 
the asymptotic distribution (at least, in situations similar to the median regressions with 
potentially endogenously missing or censored data considered here). 

Tabled] reports these results. We note that size distortions are generally minimal, except 
for the smaller sample sizes with the largest value of t n , particularly with nominal size a = .1. 
As one might expect from the methods used in the derivation of the asymptotic distribution, 
which rely on tail approximations, the asymptotic approximation performs better for the 
smaller value of the nominal size a. The fact that size distortions are more severe with the 
larger t n = rr 1 ^ is likely a reflection of the fact, for a fixed nominal size a, the asymptotic 
approximations depend on t n being small relative to the support of Xj. In contrast, size 
distortions are minimal for t n = n -1 / 3 for most cases considered here. 

Next, we examine the power of our test. We report monte carlo estimates of the power 
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of our test for each design and parameters given by (6*i + a, 0) for a = .1, .2, .3, .4, .5. To 
ensure that power is not driven by false rejection under the null, we use critical values based 
on monte carlo estimates of the finite sample exact least favorable distribution. We report 
power results for level .05 tests. Tables EJ [3] and H] report the results. As expected, moving 
away from the identified set by a given amount generally leads to more power under the 
designs with smoother conditional means. In addition, the finding that the choice of the 
truncation parameter t n doesn't matter much as long as it is small enough appears to be 
borne out in the monte carlos (e.g. for Design 2, t n proportional to n -1 / 3 is optimal, and 
this value of t n performs best, but choosing t n = n -1 / 2 gives close to the same power, while 
t n = n -1 / 5 gives much worse power). 



6 Empirical Illustration 

We apply our methods to a median regression model with endogenously censored and missing 
data, using data f rom the Health and Retirement Study. The setup follows Section 9 of 
Armstrong ( 2011al ). but we repeat it here for convenience. Letting Xi and W* be yearly 



income and prescription drug expenditures for participant i respectively, we posit the model 



q 1/2 (W*\X i ) = 6 1 + 6 2 X i 



(4) 



where qi/ 2 (W*\Xi) is the median of W* conditional on Xj. 

In this survey, participants who did not report a point value for prescription drug expen- 
ditures were given a series of brackets for this variable, resulting in interval censoring for a 
portion of the observations, and some observations with a completely missing outcome vari- 
able. In other words, we do not observe W*, but only observe a random interval [W^, Wf 1 ] 
known to contain W*. The data is censored in a way that is likely to violate a missingness 
at random or censoring at random assumption: the variable is censored only for those who 
do not recall how much they spent, and it is likely that remembering how much one spent 
is correlated with the level of spending itself. 

This endogenous censoring problem makes it impossible to estimate (6\,6 2 ) consistently 
in general. We construct bounds using the conditional moment inequalities 



E[m(X i ,W l L ,Wi I ,6)\X i ] = E 



i(0i + e 2 x t < W t H ) - 1/2 
1/2 - I(9 1 + 6 2 X t < Wt) 



X,. 



> 



a.s. 



(5) 
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We test (JSJ) at the .05 level using our methods for each value of (9i,9 2 ), and report a 95% 
confidence region that inverts these tests. The resulting confidence region contains the true 
parameter value with probability at least .95. 

We restrict our sample to the 1996 wave of the survey and women with no more than 
$15,000 of yearly income who report using prescription medications. The data set also con- 
tains observations with a censored covariate (income), but, for illustrative purposes, we focus 
on endogenous censoring of the outcome variable and throw away observations where income 
is missing or censored (this is valid if remembering prescription drug expenditures is not cor- 
related with income, but may be correlated with spending itself). Our data set has 636 ob- 
servations, of which 54 have an interval cens ored outcome varia ble, and an additional 7 have 
a completely missing outcome variable. See lArmstrongi (]2011al ) for additional details about 



the data set. For the truncation parameter t n , we use n 



-1/3 . 



maxi< i<n Xi - min 1<i<n XA 



The n -1 / 3 scaling results in a test statistic that is rate adaptive to smoothness between Lips- 
chitz continuity and 2 derivatives of the conditional truncation probabilities (a smaller value 
could be used to adapt to a less smooth data generating process). For the critical value for 
our test, we use the analytically computed critical value defined in (j2J). 

Figure [U shows the resulting confidence region. For compari son, Figures [2] and [3] show con- 
fidence regions using the tests pr oposed in [Armstrong ( 1201 laf ) and lAndrews and Shil (120091 ) 
respectively, taken directly from lArmstrongi ( 2011a ). The test considered in this p aper can 
be thought of as introducing an optimal weighting to the lAndrews and Shil (120091 ) statistic 
that improves the rate for local alternatives from n~' r '( 2dx+2 ' y > to the (n/ logn) J ^ dx+2/y ^ rate 
obtained in Theorem 14. II in the set identifi e d case , while reducing the rate by a logn term in 
the point identified case. The lArmstrongi (j2011al ) test yields a slightly better improvement 
in power, but is not robust to failure of certain smoothness conditions. We also report confi- 
dence regions for each component of (#i, 9 2 ), formed by projecting the confidence region onto 
each component. Table [5] reports these confidence intervals, along with the co r respon ding; 
confidence intervals formed using other methods reproduced from lArmstrongi ( I2011a[ ) for 
convenience. 

The slope parameter, 9%, gives the median increase in yearly prescription drug spending 
associated with an increase in income. Thus, according to the results using the test proposed 
in this paper, a 95% confidence interval puts the median increase in prescription drug ex- 
penditures associated with a $1,000 in income between $5.30 and $32.00. It is worth making 
a few notes in comparing this with the confidence regions using the unweighted statistic. As 
predicted by the asymptotic power results, the confidence region for the slope parameter is 
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tighter than the one obtained using an unweighted test statistic with a critical value formed 
using subsampling with a conservative rate. The unweighted statistic gives a better lower 
bound for the slope parameter when subsampling with an estimated rate is used to form the 
critical value, but this test is less robust in the sense that it relies on additional smoothness 
conditions. 

Comparing the joint confidence regions for (61,62), we see that the tests based on un- 
weighted statistics with subsampling based critical values lead to disconnected regions of 
rejected and accepted parameter values. While the test based on a conservative rate pro- 
posed in lAndrews and Shil (120091 ) has only a small island of rejected parameter values in 
the confidence region, the test based on an estimated rate proposed in lArmstrongl (]2011at ) 
leads to numerous isolated areas in the confidence region. In contrast, our test leads to a 
connected confidence region. A likely explanation for this phenomenon is that the subsam- 
pling based confidence regions use critical values that implicitly estimate where the data 
generating process is in the null space. This leads to disconnected confidence regions when, 
as the parameter moves in some direction, the test first begins to reject as the test statistic 
increases, but then fails to reject when the critical value increases as well. In contrast, our 
test uses a least favorable critical value, so the test always moves from acceptance to rejection 
as the test statistic increases. 



7 Conclusion 

This paper considers inference in conditional moment inequality models using a multiscale 
statistic. The asymptotic distribution of our test statistic is derived, and the results are used 
to obtain feasible critical values. The test is shown to obtain certain optimal rates for power 
against local alternatives adaptively, and is the only feasible test available that does so for 
the best possible range of smoothness classes. Our results also have implications for the 
effect of moment selection procedures on power, and our test has the additional advantage of 
being adaptive without requiring such tests. An empirical application to a regression model 
with endogenous censoring and missing data illustrates the power improvement from the 
test. 
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A Uniformity in the Underlying Distribution 



We prove a stronger version of Theorem 12.11 that holds uniformly in certain classes V of 
underlying distributions for which Assumption 12.11 holds uniformly over P G V. To state 
and prove this result, we introduce some notation for indexing certain quantities by the 
underlying distribution P. We use the notation Ep to denote expectation with respect to 
the probability distribution P, and use similar notation for conditional expectations and 
conditional and unconditional variances, covariances and correlations. We make explicit the 
dependence of the identified set on P and define Oo(-P) = {9 <E Q\Ep[m(Wi, 9) > a.s.}. 

In the following theorem, the conditional distribution (including the conditional mean) 
of m(Wi, 9) given Xi = x is allowed to vary over V . In particular, since no conditions are 
placed on the conditional mean of distributions in V, the result shows that tests based on 
this asymptotic distribution result control the asymptotic size uniformly over distributions 
for which the conditional mean can be nonsmooth in arbitrary ways, although there are some 
mild continuity assumptions on the conditional variance. We do, however, impose the same 
distribution of Xi for all P G V. This is mostly to avoid introducing additional notation in 
the proof, and could be relaxed. 

Theorem A.l. Let c n , a(c n ) and b(c n ) be defined as in Theorem \2.1\ Suppose that Assump- 
tion \2J\ holds for the same constants in part (a) for all P G V and with the continuity in 
part (ii) of part (a) uniform over P G V . Then, for any vector r G M. dY , 

liminf inf P(a(c n )T n (9 ) - b(c n ) < r) > P(Z < r) 
n PeV,6 ee (P) 

where Z is a dy dimensional vector of independent standard type I extreme value random 
variables. If, in addition, Ep[m(Wi, 9o)\Xi] = for all P 6? for some 9 , then, for this 9q, 

a(c n )T n - b(c n ) A Z 

uniformly over P G?. 

For completeness, we also include the following theorem, which states that the tests 
proposed in this paper control the size uniformly over classes of distributions that satisfy the 
conditions of the above theorem. 
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Theorem A. 2. For any class V of distributions satisfying the conditions of Theorem \A.l\ 



limsup sup P(S n (9 ) > gj_ a ) < a, 
n PeP,6> ee (P) 

and the same holds with q\- a replaced by q\- a ,sim as long as the sequence of bootstrap con- 
ditional variance matrices is continuous uniformly over x and has correlation coefficients 
bounded away from one with probability one with the continuity and bounds on the correla- 
tion coefficients uniform over P EV. 

Theorem IA.2I follows immediately from Theorem IA.1I (for the validity of the critical 
value qi- a ,sim, the result follows from applying Theorem IA.1I to the sequence of bootstrap 
processes). We prove Theorem lA.ll in the next appendix. 



B Proof of Theorem A.l 



We first prove a version of Theorem IA.1I where the XiS are deterministic and a 2 is replaced 
by a certain sample average of conditional variances defined below. The result then follows 
from showing that the conditions of this result hold almost surely conditional on {Xj}" =1 , 
and that replacing the sample average of conditional variances with a 2 does not change the 
test statistic too much. 

Throughout this section, we fix 9 and let = m(Wi,9), and drop the 9 notation else- 
where such as in the definition of a n j(s,t,9). We prove the following result with {Xi}f =1 
replaced by a deterministic sequence {xi}™ =l . We consider a set V determining the proba- 
bility distribtuion of Y{ for a given X{. 

Let J 7 = {F x p : x G X,P G V} be a family of dy- dimensional distribution functions, 
with X a compact, Jordan measurable subset of lR dx such that vol(A') > 0, that is it has 
positive dx dimensional volume. Consider (xi, Yi), (X2, Y2), . . . with Xi deterministic and 
Yi ~ F x . independent. Define ^p{x) = E x P Yi and Sp(x) = Cov^l^, where the subscript 
x, P denotes with respect to Yi ~ F Xt p. We use the notation 2y to denote the jth coordinate 
of the 2th observation or element in a sequence {z{\. Let I(s,t) = Ylf=il s ji s j + tj)- We 
abuse notation slightly and define vol(t) = n,!=i^' f° r a vector t. Let J n (s,t) = {i : 1 < i < 
n,Xi G I(s,t)}. We consider the following regularity conditions. 

Assumption B.l. 
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a.) There exists A > and M\ < oo such that 



E Xt p(e m ^) < M\ for all x G X, 1 < j < d Y , P G V. 
Hence the characteristic function ofY^j is analytic on (—A, A) for all j andYi ~ F x ,p, 

xeX, P eP. 

1 /2 

b. ) <7j t p{x) = Hjjp(x) is continuous and positive on X for all 1 < j < dy uniformly over 

P eP. 

c. ) (for d Y > I): 

p = sup sup sup —- — — < 1 . 

Pev %+j x&x a it p{x)a jt p(x) 

Assumption B.2. There exists a continuous, positive and bounded density function f on 
X and a sequence t n — > such that 

a. ) r<*|logt n |- 4 -> oo ; 

b. ) for any 5 > 0, #J n (s, t) ~ n f r , ^ f(x)dx uniformly over I(s, t) C X such that vol(t) > 

<-/|logt„| 2 . 

Define a nJ (s,t) = {J2ieMs,t)WjA x )} 2 Y /2 and let 



Tn,j = - inf Yi 'i I '[yfanjM] 

I(s,t)CPc ,t>t„l "—^ / 

l&Jn(s,t) 

(we suppress the dependence of a n j(s,t) and T n j on P for notational convenience). 

Theorem B.l. Suppose that //p(x) > for all x G X , P G P and that Assumptions 
EU andUM hold. Let a n = (2n logt"^) 1 / 2 and b n = 21ogt^ x + {2d x - \) log log t~ dx - 
log[2 A /7r/i'o/(A')]. Then, for any vector r G M. dY , 



liminf inf P (a n f n - b n < r) > P(Z < r) 

n~s>oo Pe"P V / 



where Z is a dy dimensional vector of independent standard type I extreme value random 
variables. If, in addition, fip(x) = for all x G X , P G P, then 



lim sup 

n— >oo p^/p 



P a n T n -b n <r) -P(Z<r 



0. 
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The result follows from this and the following lemmas. 

Lemma B.l. Under Assumption \2.1\ part (b) of Assumption liOl above holds for almost all 
sequences {Xi}™ =1 . 

Proof. We have 



n Il(s,t)f( X ) dx 



E n I(s <Xi<s + t)- EI(s <Xi<s + t) 
EI(s <X { < s + t) ' 



logt„ 



2 > 
□ 



This converges to one uniformly o yer (s,t) with y ol (t) > K n (logn)/n for any sequence K n — > 
oo by Theorem 37 in Chapter 2 of lPollardl (119841 ). and the conditions nt^ x /| logt n | 4 — > oo and 
vol{t) > 5t d n x /\ \ogt n \ 2 guarantee that vol(t) > dtf^ /\hgt n \ 2 > K n n~\ \ogt n 
K n (\ogn)/n for some K n — > oo. 

Lemma B.2. Under Assumption \2. 11 vol(X) A vol(X). 

Proof. For a given e, 5 > 0, the following event will hold with probability approaching one: 
for every point ek in the grid (eIi dx )r\X , at least one observation Xi will have each component 
Xij within 5 of ek. Once this holds, the set el((ki + 5, . . . , k& x + 5), (1 — 5, . . . , 1 — 5)) will 
be contained in the convex hull of the XjS for all k such that el(k, 1) C X. This gives a 
lower bound of (1 — 2§) dx v ol(U e j^,i)cx£l(^^ 1)) f° r the volume of the convex hull of the X^s, 
which can be made arbitrarily close to vol(X) by Jordan measurability. The result follows 
from this and the upper bound vol(X) < vol(X). □ 

Lemma B.3. Under Assumptions \B.1\ and \B>.2\ (with the Xi's treated as nonrandom), 
sup s s+teX t>t„ J£& ^(st) ~ 1 — opQog^) -1 uniformly over P G V and, if m(6,x) = for 



all x, sup SiS+te ^ 



t>t n 



1 



Op(logn) 1 uniformly over P G V. 



Proof. First, note that, since x i— > 1/x 2 is decreasing and differentiate at one, it suffices to 



show that inf 
Note that 



s,s+teX <r* .(s,t) 



1 > —op (log n) 1 and sup s 



s+t&X 



nd*As,t) 



- 1 



o P (log n) 



-i 



t) - o- 2 n j (s, t)/n = - Yl 

ieJ n (s,i) 



Y 2 . — 

%,3 



- E Y « 

ieJ„(s,t) 



i £ ^,p(x) 2 = / + 



77 



where 



'4 E K-^) 
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and 



i€Jn(s,t) 



- E 

n 

IS Jn(s,t) 



"I 2 



i E [3* y «] 



iS Jn(s,t) 



- E ^ 

iS Jn(s,t) 



We first bound // [o" 2 ,-(s, £)/n] where J is given above. Let W< = ^ 2 -E^pF; 2 -. Note that 
cr 2 (s,t) is bounded from below by a constant times #J n (s,t) uniformly over P E V, so it 



suffices to consider ^$^ieJ n (ai) WiJ /#^n( s >£)- For some sequence let Wj = < 
K n ) be a truncated version of W{. Note that, by Markov's inequality, for A > given in 
Assumption IB. 11 



P(\Wi\ > K) < E XiiP exp(A v / lW- XVK) 



so 



P(\Wi\ > K some 1 < i < n) < nexp(-XVK) sup E X:P exp(\^/\W~\), 

xeX,Pev 

which goes to zero for any K = K n that increases faster than (logn) 2 . To bound \E Xit pWi 
\E XuP Wi - E Xi)P Wi\, note that 



{E^pHWilldWil > K)]Y < E XitP (Wf)P(\Wi\ >K)< Cexp(-XVK) 

for some constant C that does not depend on P or Xi. Thus, | J2iej n ( s t) Pxi,pWi\/#J n (s, t) < 
[Cexp(-AV^)] 1 / 2 , which goes to zero at a polynomial rate for K n increasing faster than 
(logn) 2 , which is faster than the required logn rate. 

Using the fact that the supremum over (s, t) is determined by the maximum over no 
more than n 2dx possible deterministic configurations for J n (s,t), and that for any 5 > 0, 
<5(logn) -1 > # J„(s, t)~ 1//4 for large enough n, 



P I sup 

.s,s+tex,t>t n 



i&J n (s,t) 



[Wi - E XitP Wi] 



< n 2dx sup P 

s,s+t£X,t>t n 



> 5 (logn) 1 

>#J„M)~ 1/4 



Now, using Bernstein's inequality, for C a bound for the fourth moment of Yij, the above 
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display is bounded by 



" S)S+ S>*„ 6XP V C#J n (s,t) + K n [#J n (s,tr/*]/3j- 

Let K n be such that K n < ^^(s,^) 1 / 2 all (s,t) and -K"„/(logn) 2 — > oo. For large enough n, 
this gives a bound in the above display of n 2dx sup s s+te x t>t exp(— # J n (s, t) 1 ^ 4 ) — >■ 0. 
As for JJ, we have 



n 2 



n 2 



II > 



> -2 




ieJ n (s,t) 



and similar methods show that the last line divided by a n j(s,t)/n converges to zero at a 
faster than logn rate uniformly over (s,t) with s, s + t G X, t > t n . If rhj(9,x) = for 



all x, then II = — 



V 



and 



[cr n j(s, t)/n) also converges to 



zero at a faster than logn rate uniformly over (s,t) with s, s + t G Af, t > t n by similar 
arguments. □ 



B.l Proof of Theorem B.l 



We begin by proving the result in the case of a univariate outcome Yi = m(Wi,8). Section 
IB .41 generalizes the result to the case of multivariate Y^. 

To simplify notation, we let d = dx and we omit the subscript P when dealing with 
expectations and other quantities that depend on the underlying distribution P. Let Zi = 
Ufa) - Yi and let B n = {(s,t) : I(s,t) C X,t > t n l}. Define 

M n (s,t) = ^ J fi Zl for (M) G B n . (6) 
a n (s,t) 
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Let c 



Note in particular that 



t- d (c 2 /2) 2d -h~ c2 / 2 

rf/.., imw * ( lL„, ,M/ 2 log[(ci|logt n |) 2 ^Wol(A')e c /2v / 7rll 2 
~ CM logg)--. exp ( - -{(2rf| log(„|)^ + g ( M|]og<,|)^ J 

->■ [2 v / 7r/vol(A')]e _c as n -»■ oo. 



(7) 



Theorem IB.ll in the case of univariate Y follows from 



lim sup 



P{ sup H n (s,t) > c} - [1 -exp(-e _c )] 

(s,t)eB n 



->■ for all C G 



Consider a change-of- variables by defining X c such that 



X c (-w, u) = M n (ut n , (v - u)t n ) for (ut n , (v - u)t n ) G B n . 



(9) 



The domain of X c is thus D c = {(— u, v) G (— t~ l X) x t -1 ^ : v — u > 1}. Note that X c (— w, i>) 
is a normalized sum over observations for which Xi lies in the rectangle {x\ut n < x < vt n }. 
The change of variable and unusual notation are designed so that, for a, b > 0, the rectangle 
associated with X c (— u + a, v + b) contains the rectan gle associated with Xp (— u, v). This helps 



Chan and Lail ( 120061 ) involving positive 



with the verification of some of the conditions in 
increments of the process. 

Let ip(z) = 7^ e_z2//2 an d A c = (2c 2 ) -1 . Consider a restriction of D c to D L {= D C>L ) 
{(— u, v ) G D c : v — u < LI} for some L > 1. Let 



K = {{-u, v) G -I{w, | logt n |) x I{w, | log* n |) : 1 < (v - u) < LI}. 



(10) 



We wi ll show that regularity conditions (C) and (Al)-(A5) in Corollary 2.7 of 
( 120061 ) are satisfied uniformly on the domains Dl and over P EV and hence 



Chan and Lai 



q w , P = P{ sup X c (-u, v)>c}~ ?P(c)A~ 2d [ H( 



-u, v)d(—u, v) 



'ID 



uniformly over I(w, | logt n |) C t n x X and P G V, where H is defined in that paper and, as 
shown below, takes the form 



H(-u,v) = A- 2d vo\(v - u) 



(12) 
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in our case. Conditions (C) and (Al)-(A2) of IChan and Lail (120061 ) are verified in Section 
IB.2| and conditions (A3)-(A5) are verified in Section IB. 51 

We partition t~ l X into cubes of length | logt n | and apply (ITT]) on each cube to show (jSJ). 
More specifically, define Q n = {w G (| \ogt n \Z) d : I(w, | logt n |) C t^X}. Since X is Jordan 
measurable and t n \ logi n | — > 0, 

#Q n ~vo\(X)/(t n \\ogt n \) d , (13) 
and it follows from (ED, (UJ) and (JT2J that 

Q w ,p ->• A = (1 — L~ 1 ) d e~'° (14) 

Ul£Qn 

uniformly over P 6 P, noting that A is the limit of ^(c)A~ M (#Q n )| logt n | d L L ^ d vo\(t)~ 2 dt. 
Since X c is independent over and _D^ 2 for wi,W2 G Q n , wi 7^ u>2) it follows from the 
Poisson limit of the Binomial distribution that 

P{ sup sup X c (— it, v) > c} — > 1 — e~ A 

uniformly over P G "P. Hence to show (JS}, it suffices for us to prove the following: 
Lemma B.4. (a) For all e > 0, there exists L large enough such that 

Pi = sup P{ sup X c (— u, v ) > c} < e for all large c. 

PeV (-u,v)£D c \D c>L 

(b) P2 = SUp P6 p ^tox^eQn,^^ ^{ SU P«6/(wiJlog^ > 

c} ->■ 0. 

(c) p 3 = sup PeP P{sup ( _ Uit>)6i , t \ Uiii6QBJK | ]ogtn [ ) X c (-u,T;) > c} -> 0. 



We prove t 



lis le mma in Section IB. 31 Sections IB. 21 and IB. 51 verify the conditions of 



Chan and Lail (120061 ) for the tail approximations used in the above argument. Section IB. 41 



extends the results to multivariate K. 



B.2 On (EES) and the Verification of (C), (Al) and (A2) 



Let $ be the c.d.f. of the standard normal. 
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Lemma B.5. (a) Let S n = U\ + ■ • • + U n and s\ = Vax(S n ). Assume that Ui, . . . , U n are 

independent mean random variables and there exists A > ; M\ < oo and o"q > such that 



E( e x ^) < M\, Vax(t/ fc ) > o£ 1 < k < 



n. 



Let 1 < x n = o(n 1 / 6 ). Then there exists a constant C > dependent only on \, M\, and 
{x n }n>i such that 



1 - $(x 



< 



Cx 3 



for all 1 < x < a; n , n > 1. 



(b) [(Al) of IChan and Lail (120061 )] P{H n (s, t) > c - y/c} ~ ip(c - y/c) [~ 1 - $(c - y/c)] 
uniformly over P EV and positive, bounded values of y and (s,t) E 



Feller 



Proof. The special case of i.i.d. Uk in (a) reduces t o Theorem 1 in Chapter 16.6 of 
( 1971 ). Theorem 3 in Chapter 16.7 of Feller f 197lh extends Theorem 1 to non-identically 
distributed random variables such that E(\Uk\ 3 ) / E(U%) are uniformly bounded, with a 
O(j-) instead of a error bound. We follow step-by-step the proof of Feller's Theorem 3, 
using the additional condition Var(f4) > o"q to obtain the ^= error bound in (a). 

Under Assumption IB. 11 H n (s,t) = - s * s ^ , where S* is a sum of independent mean 
random variables satisfying (i) and (ii) with the bounds uniform over P EV and Var(S'*) = 
a 2 (s,t). Hence by (a), 



P{M n (s,t)>c-y/c} „( {c-y/cf \ 

-± + u[ ; — — — ) as c — y/c — y oo 



1 - $(c-y/c) 



a n (s,t) 



(15) 



uniformly over P 6? and (s,t) E B n . Since vo\(I(s,t)) > t d n for (s,t) E B n , by Assumption 
EZTb), 

liminff inf #J n (s,t)]/{nt d n ) > inf f(x) > 0. 

n->oo (s,t)eB n x£X 

Hence by Assumption IB. 1( b) and IB. 2( a). a^ 2 (s,t) = 0((nt^) _1 ) = o(|logt n |~ 4 ) uniformly 
over (s,t) E B n and P EV. Since c = 0{\ logt n | 1/2 ), (b) follows from (TTS}. □ 



Let Pc(— u, v , — Ui, Vi) = Cov(X c (— u, v), X c (— u%, (we suppress the dependence of p c 
on P in the notation) and let {W_ U:V (q, r) : (q,r) E [0, oo) 2d } be a continuous Gaussian 
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random field satisfying 



4- 

W- VtV (0) = 0, E[W- u , v (q,r)] = -J2 j j v 

4{Vj Uj) 

Cov(^fe r ),W-„,„(a,«) =f minfe ;7 ) + n "° feft) . 



(16) 



Lemma B. 6. (a) [(C) of IChan and Lail (j2006h ] 1 - p e (-u, v, -u + 5 u ,v + 6 V ) ~ Yfj=i 2(4-tjj 



uniformly over (—u, v) G -Dl and P &V and compact sets of (S u , S v )/ A c > 0. 



(b) [(A2) of IChan and Lail (120061 )] For any a > and positive integer m, as c —> oo, 



{c[X c (— u + ak u A c , v + ak v A c ) — X c (— it, u)] : < < ml}|X c (— u, v) = c — y/c 

4- {W_ UjU (a£; u , oA^,) : < (k u , k v ) < ml}, 

uniformly over (—u,v) G Dl and P G V and positive bounded values of y. 

(c) H(-u,v) = linix^oo J °° e 2/ P{sup < (?ir )< K1 W- U>v (q,r) > y}dy has the closed-form 
given in f|T2|) . 



Proof. Let = (s,t), where s = ut n , t = (v — u)t n and zs = (s — S u t n ,t + (5 V + S u )t n ) for 



some 8 U , 5 V > 0. Then 



p c (-u,v, -u + 5 u ,v + 5 V ) = a n (z ) / a n (z 5 ) 



(17) 



1 + 



Since A c ~ (4| logtnl)" 1 , by Assumption IB. 21 



^n(^) -^»(^) V 1/2 = 1 _ gj[(fg) - ^n(^o) n ( ° 2 n( Z 6) ~ ^ni^ V 



a 2 n (z ) ~ na 2 ( S )/( S )vol(t), [a 2 n (z s ) - a 2 n (z )} ~ na 2 ( S )/(s)vol(t) ^ 



5=1 



and (a) follows from substituting ( !T8|) into (JTTJ) . 
Let a > and let 5 U = ak u A c , 6 V = ak v A c . Then 

c[X c (-u + 5 u ,v + 5 v )- X c (-u, v)} = c{ ^ J ^s)\Mzo) Z i + y) 



(19) 



We note here that as 8 U , 5 V > 0, so J n (zg) D J n (z ). By (jT6l) - ([T8l) . conditioned on X c (— u, v) 
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c — y/c and noting that c 2 A c = |, 



cX r (— w, i0 



Q-n(^o) 



- 1 



Var 



7 

0"r t (^) 



; - E[W- UtV (ak u , ak v )}, 

->■ Var(M/_ u>t ,(afc u , a/c„)). 



cVn(*«) - ^n(^o)] 



(20) 
(21) 



Similarly, if z~ & = (s — S u t n ,t + (5 U + 5 v )t n ), where 5 U = ak u A c , 5 V = ak v A c with k u , k v > 0, 
then 



Cov 



C J2i£j n (z 5 )\J n (z ) Z i C ^ieJ n (z- s )\Mz ) Z i\ c2 l a l( Z min(6~8)) ~ a l( Z o)} 



(22) 



a n (z s )a n (zg) 
-)> Cov (W- U}V (ak u , ak v ), W- UjV (ak u , ak v )). 



Since Ylii^j n (zA\j n (zo) Zi * s independent of X c ( — u, v) and is asymptotically normal by Assump- 
tions E?Hb)-(c) and IB. 2Tb), (b ) follo ws from (I20l)-( l22i) . Lastly, (c) is a direct consequence 



of Lemma 2.3 of 



Chan and Lai 



(120061). 



□ 



B.3 Proof of Lemma B.4 



To deal with technicalities associated with non-rectagular edges, we extend the domain of H n 
to C x [t n , l) d for some C = [-C, C] d by embedding (x\ } Yx), (x2, Y 2 ), ... as a subsequence of 
(xx, Yi), (x2, Y 2 ), . . . with Xi G [— (C + 1), (C + l)] d . Hence the domain of X c can be extended 
to {(— u, v) G t^C 2 : 1 < v — u < t^ 1 !} with (C) and (Al)-(A5) satisfied uniformly over 
{(— u, v) G t~ l C 2 : 1 < v - u < LI} for any fixed L > 1. 



Proof of Lemma\K^c). Let Q n = {w G (| \ogt n \Z) d : I(w, \ \ogt n \) fl (t^X) ^ 0}. Since X 
is Jordan measurable, #Q n ~ #Q n - Hence by ffTTj) and (EG 



P3 < sup V" g w>P = o sup V 9»,p) =o(l). 
PeP - v PeP y 



n 



weQn 



□ 



Proof of Lemma \B.$ b). For n large enough such that |logi n | > L, u G /(u>i, | logi n |), 
f G I(ii>2, I log^nl); v — u < LI can occur only when w 1; to 2 are neighboring cubes. Note that 
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each cube has not more than 3 d — 1 neighbors. When w\ and w 2 neighboring cubes, define 

D m, W2 = {(- u ' v ) '■ u e H w i^ I ^ogt n \),v € I(w 2 , 1 log t„ I ) , 1 < v - u < LI}. 
Since vol p* 1)t03 ) = o(| logt„|" d ) and H(-u,v) < 1, by Corollary 2.7 of Chan and Lai (2006), 

P{ sup X c (-u,v)>c} ~ ip(c)A- 2d H{-u,v)d{-u,v) 

= o(^(c)A c - M |logt n |- d ) 



uniformly over P £ V and over neighboring u>i and u> 2 - Hence by ([7]) and (fl3|) . p 2 = 

o((#Q n )ij(c)A- 2d \ \ogt n \~ d ) = 0(1). □ 

Proof of Lemma\K^a). For each £eZ d , £^0 with < £ < [log L (2t~ 1 C)]l, define 

X c/ (-u,v) = M n (ut n L £ , (v - u)t n L e ) for u,t e {t n L l )~ x C with (v-u)> 1. 

We use here the convention aC = n^Lii - a jC, cijC). To avoid double counting, we restrict 
the domain of X c ^ to 

D t = {(-ix, v) e (tnL^C 2 : 1 < v - u < LI}. 



By Corollary 2.7 of IChan and Lail fl2006f ). 



P{ sup X c (— u, v) > c} ~ ■0(c)A c 2d / H(—u,v)d(—u,v) uniformly over £ and P EV, 

(-u,v)eDe JDi 

(23) 

with H(-u,v) = 0(1) uniformly over £ and D t . By ©, V'(c)A- 2d vol(t- 1 C) = 0(1) and so 
tp(c)A~ 2d I H(-u,v)d(-u,v) = 0(\L e \- 1 ) uniformly over L (24) 

Hence by (|23|) and (|24j). 

Pi < sup > P{ sup X c (-m,v) > c} = O > 1^ I ■ 

PeP (-u,v)eD( ^ , ' 

0<^<[log L (2t- 1 C)]l,^0 0<f<[log L (2t- 1 C)]l,^0 

The sum above within O(-) is bounded by (^^L L~ k ) d — 1 = (1 — L^ 1 )^ 4 — 1 which can be 
made arbitrarily small by choosing L large enough. □ 
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B.4 Extension to Multivariate Y 

Let E wJ = {sup^^gx,. X C)i > c}, where c = ft " +nuni yy f or g i ven e dy , see 

({TO]) . To extend the proof of Theorem IB. II to dy > 1 , it suffices to prove the following: 

Lemma B.7. p 4 = sup PeP E™ £ q„ £ii# a n E w>h ) -> 0. 

Proof. Fix if and partition .D^, into cubes of length A c . More specifically, define K c = 
{z £ (A C Z) M : I{z, A c l) n £;} ^ 0. Let = {sup ( _ WiV)e/(W) X^- (-«,«) > c}. Then, 
uniformly over z, P £V and 1 < j < dy, 

P(G^ n {X cJ (z) < c - 0/c}) ~ i/){c)H g (z), (25) 

/■oo 

where H 9 (z) = / e y P{ sup Vt^w) > y}dy. 
Je o<w<i 



This extends Theorem 2.4 of 



Chan and Lail (120061 ) to 9 ^ 0, using the same proof. Since 



Hq(z) < oo, for any given e > 0, we can select 9 large enough such that Hg(z) < e. In 
addition, by (ITS"]) , this selection can be made to be uniform over z £ K c and 1 < j < dy. 
Note that 

P{E wJl D -E^jJ < P(X c j 1 (zi) > c-9/c, X cj2 (z 2 ) > c - 9/c for some zi, z 2 £ K c ) + ?y c ,™, 
where Vc , w = ^ n {X cJl (z) < c - 9/c}) + P{G Z , J2 D {X C)ja (z) < c - 0/c})], 

and with 6* selected so that H e (z) < e, it follows from (125]) that 

Jfe™ = eO(V(c)(#iQ) = eO(^(c)A c - M | logt n | d ). 

By ©, r/>(c) = O(i^A^) and hence by ( 1X3]) . X] U)6 Q n ^c,w = e O(l). It remains for us to show 
that for all 9 > 0, 

PQLcjM > c - 9/c, X c>j2 (z 2 ) > c - 9/c) = o(^(c)A- 2d \ \ogt n \ d ). (26) 

Z\,Z2&K n 

Now by Assumption IB. 1( d). S(z±,z 2 ) = X CJ1 (2;i) fl X c j 2 (z 2 ) has mean and variance lying 
between 2(1 - p) and 2(1 + p). Let k = (y^) 1/2 (> 1). By Lemma E^a), 

P{S(z u z 2 ) > 2(c-9/c)} < [l+o(-^)][l-$(K(c-e/c))] ~ —-L-^e-- 2 - 2 /^^ (27) 
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uniformly over P G "P. Since j^K c = 0(A C | logt n | ), it follows from (136]) that 



P{^c tjl (zi)>c- e/c, x Cii2 (z 2 ) > c - e/c) 



< p i S ( z ^ z z) > 2 ( c - e /°)} = 0{^{c)e-^ 2 - 1)c2/2 A- M \ logt r 

zi,z 2 &K c 



\2d\ 



uniformly over P G V and (126]) holds because |logt n | d = 0(c 2d ) and c M e ^ 1 ) c2 / 2 = 
o(l). □ 

B.5 Verification of (A3)-(A5) 



Conditions (C), (Al) and (A2) have been verified in Section IB. 21 The remaining regularity 
conditions that lead to (ITT]) will be verified in Lemmas IB.8I and IB. 91 below. 



Chan and Lai 



(120061 )] Let 7 > and k u , k v > 0. There exists a 



Lemma B.8. (a) [(A3) of 
positive function h such that lim^oo h(y) = and 

P{X c (—u + k u A c , v + k v A c ) > c — 7/c, X c (— u, v) < c — y/c} < h(y)ip(c) for all large c, 

uniformly over (—u,v) G Dl and P EV. 
(b) [(A5) of 



Chan and Lai 



(120061 )] There exists a nonincreasing positive function r on 



[0, oo) such that r(\\k\\) = 0(e~^ F ) for some p > such that for any 7 > 0, 

P{X C (— u,v) > c— 7/c, X c (— u+k u A c , v+k v A c ) > c—^/c} < if)(c— , y/c)r(\\k u ,k v \\) for all large c, 

uniformly over P <EV, (—u, v), (—u + k u A c , v + k v A c ) G and w G Q n . 

Proof. Let u> > 1 to be specified later. By Lemma IB. 5( a). there exists £ c — > such that 

P{X c (-u, v)>c- y'/c} = [1 + 0(C 2 )P>(c) 

uniformly over 7 < y' < wc and P G V. Let yj = y + j£ c , j — 0, 1, . . .. Let u\ = u — k u A c 
and v\ = v + k v A c . Since e 5c = 1 + £ c + 0(£), 



P{X c (-u, v) > c - y j+1 / c} - P{X c (-u, v) > c - yj/ c} 
[1 + 0(e c )}e^ + H(c) - [1 + 0{C)]e y ^{c) ~ £ c e^(c) 



(2f 
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uniformly over 7 < yj < uc and P G P. Since P{X C ( — Ui, v 1) > a|X c (— u,v) = b} increases 
with b for any fixed a, it follows from ( 128]) that 

P{X c (-«i, «i) > c - y/c, c - lu < X c (-«, v) <c-y/c} (29) 
< ^ P{X c (-Mi,Wi) > c-7/c|X c (-m,w) = c-yj/c} 

0<j<(ucr-y)/£ c 

x [P{X c (-w, v)>c- y j+ i/c} - P{X c (-m, u) > c - yj/c}] 
~ ^(c)£ c ^ e % P{X c (-ni,wi) > c- 7/c|X c (-w,t>) = c-yj/c}. 

0<yj <loc 

Let ^r<5 = (vi — Ui)t n ), k = (k u , k v ) and let 

g-uM = E k A u ' ] + K '\ {= E[w. u . v (K, K)] = Vav(w^ v (k u , k v ))/2}, (30) 

(see (USD). Then by (fl8])-(j2l]) with a = 1, 

P{c[X c (-mi, u x ) - X c (-u,v)] > yj - j\X c (-u,v) = c - yj/c} (31) 

p r E ieJ „ (2f )v w(2 o) g > y_i - 7 - c(c - y 3 /c) [^{g - iK(g) } 

V^^) -^n(^o) ~ Cv/flT^^) -(t2(z ) J 

p f E»gj„(^)\j n ( Z0 )^ > yj -7 + ff- u ,„(fc) + o(jj l ,f Vj-l + 9-u,v{ k ) 
V^n(^) - ^(^0) " v /2 ^( fc ) + °( 1 ) '~ ^ 

with o(l) uniform over y < yj < 00c and (—it, u) G P^ and P G P, noting that as y' = 0(c) 
the relative error of the normal tail approximation in (l3~Tj) is 

c 3 \ „ / c 4 



(see Assumption IB. 2( a) and Lemma IB.5f a)). By f[2"9"j) and f[3"TT) . 

P{X C (— it!, ui) > c — 7/c, c — oj < X c (— -u, f ) < c — y/c} (32) 

1 h V v/2 9 _„,„(fc) ' 
To complete the proof of (a), it suffices to show that 

(II) = P{X c (-n 1; vi)>c- 7/c, X c (-it, v) < c - to} = o{4>(c)) (33) 
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for all ui large. By (jSJ) and 



= a , n (zi)X c (-«i,Vi) - <7 n (2 )X c (-u,i>). (34) 

Since cr n (^) > cr n (2 ), by (12T|) . (I30I) and Lemma IB. 5( a) with a — 1, 

EieJ„(^)\J„( Z0 ) Zi ^ a n (2 )(wc-7) i L „/ 



(II) < P > j, = l + 0(^)^) 

'Oil,/ 



where x r . 



a 2 n {z 5 ) - ol(z Q ) al{z 5 ) - al(z ) 

ujc — 7 cue — 7 



Cx /4A c [g_ U) „(fc) + o(l)] ^/2[0_ Ui „(fc) + o(l)]' 



and indeed (I3"3"l) holds when a; > yj2g^ u>v {k). 

To prove (b), we apply Lemma IB. 5( a) to the right-hand side of the inequality 

P{X c (-u,v) > c-'y/c,X c {-u 1 ,v 1 ) > c-i/c} < P{X c (-u,v)+X c (-u 1 ,v 1 ) > 2(c- 7 /c)}[= (III)]. 

As in the proof of Lemma [B.4( b). the relative error of the normal approximation goes to 
due to Assumption IB. 2( a). that is, 

(III) ~ i, ( 2 ( c ~7/c) \ asc ^ QO (35) 



a/2 + 2p c (-u,v, -ui,vi) 
Note that in the statement of (b), the restriction k u , k v > is removed and we have in place 

of (ED, 

p c (-u, V, -U 1 , Vi) = — — — — , 
a n {z )a n {zs) 

where z* = (—(uVui),(vAvi)). Since J n (z*) = J n (z )nJ n (zg), so by expanding a n (z*) / a n (z ) 
and a n (z*) / a n (zs) as in i fPTj) . it follows from (Ti~8l) and ( 1301) with <5 U = /c M A c , <5„ = /c„A c that 



p c (-u,v,-u h vi) = l-(l + o(l)) 



2al(z*) 2a* n (z*) 

(M + + (Sv,j) + , ^ (^)- + (M~ 



1 - (1 + (1)){ ^ 1 "' 3 " r + } 



, i 'V - "./ u i - '0 



l-(4 + o(l))A ( ^_ u>w (|A;|) > 
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from which it follows that 

2(c- 7 /c) 
a/2 + 2p c (-u,v, -Mi, v x ) 

and (b) with r(r) = exp[— mirip|| =T g'_ W)U (|A;|)/4] and <p < 1, follows from (1331) . 



:- 7 /c)[l-(2+o(l))A c ^(|fc|)]- 1 / 2 > c+[0_ u ,„(|fc|)/2- 7 + O (l)]/c, 



□ 



Lemma B.9. (a) [Theorem 1 of IWichural (119691 )] Let A be a finite subset ofW d and let Ui, 
i G A be independent mean random variables with variance of. Let Sk = J2i<k^i an< ^ se ^ 
S A = Y.ieA a i' S A = Eie.4 U i- Then f or an V x > 2<is A, 



P(max\S k \ > x) < [1 - (2 d s A /x) 2 ]- d P(\S A \ > 2- d x). 



(36) 



(b) [(A4) of IChan and Lail (120061 )] There exists nonincreasing functions N a on IR + and 
positive constants 7 a — > such that iV a (7 a ) + L r s N a {^ a + r)dr = o(a d ) as a — )■ for all 
s > 0, and for each a > 0, 

P{ sup X c (-m + k u A c ,v + k v A c ) > c,X c (-u,v) < c-j/c} < N a (j)i/;(c), (37) 

0<(fc„,fc„)<al 

uniformly over (—u, v) G Dl and P EV for all j a < 7 < c with c large. 



Proof. Though IWichural (119691 ) considers a set A with points lying on a d- dimensional grid, 
we can always extend A to a d- dimensional grid B by letting [/, = for i G i3 \ A. Note 
that the right-hand side of (1361) is unchanged by such an extension. Let u\ = u — k u A c , 
v 1 = v + k v A c , k = (k u ,k v ) and z$ = {uit n , {vi — Ui)t n ). Let u > 1 to be specified later. 
Since cr n (zs) > o- n (z ) when J n (zs) 3 «/n(-2o)) by the arguments in (129|) and (13T1) . 

(I) = P{ sup X c (— tti, %) > c, c — u; < X c (— u, v) < c — 7/c} (38) 

0<fc<al 

~ V(c) / e y P<^ sup V" > a n (z )y/c\dy. 

J 1 ^ 0<fc<al , , J 

Let £> be the set of all d- dimensional vectors with coordinates taking values —1, or 1 
but not all zeros. Hence #£> = 3 d — 1. Consider the partitioning of A = J n ((u — aA c )t n , (v — 
u + 2aA c )t„) \ J n (z ) as .4 = {J beB A b , with 

Ah = {i : («j — aA c )t n < < Ujt n if bj = —1, 

Mjtn, < Xij < Vjt n if bj = 0, 
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Vjt n < Xij < (vj + aA c )t n if bj = 1}. 

Then 

sup Zi < } max S k b , where S kib = } Z { 



°- k - al ie.J„(z s )\J n (zo) beB ieAb,i<k 

Let x = By JTB} , and since u,- - Uj > 1, 

x yvol(f — -u) y r-rj / 2aA c \ n — '- -'- 



sa " 3 d c(vo\(v -u + 2aA c l) - vol(« - u)) 1 / 2 3 d c ( X + ^ _ Mj ) X J ( 39 ) 
Hence for all large c, 

^>2^/ 2 w hen,> 7 fora<(^) 2 i 
By fl36|) and fl38l) . and since < s^, 

(i) < (2 d +o(m(c)Yl r* p {\ s *\ - ^#^H < 4 °) 

tee ^ 

Apply Lemma TB.5( a) and note that the sum in (j40p is dominated by the 2<i values of 6 having 
a single non-zero entry. Then by ( |39l) . ( 1371) holds (for large c and small a) when 



We check that when 7 a = a 1 / 3 , then iV a (7 a ) + J^ 00 T s N a {^ a + r) = o(a p ) as a -» for all s > 
and p > 0, and that 

P{ SUp X c (— -U 1; Ux) > C, X c (— M, f ) < C— w} < P< SUp > U(7 n (z ) > = o(/0(c)) 

0<fc<al 0<fc<al . T ,^f 7 / \ ' 

for all u large, by a similar partitioning argument and applications of Lemmas IB. 5( a) and 
EjTa). □ 
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C Proof of Theorem 



an 



We have 

rhj(9 n ,x) =m j (6 l o, x Q ) + rhj(6 n ,x) -fhj(9 ,x) + fhj(9 ,x) -fhj(9 ,x ) 

— — I ||x - x || 7 

\\X Xq II J 

for x in some neighborhood of Xq. Thus, letting h be some small scalar going to zero with 
n, for sh and (s + t)h small enough, we have 

Errij(Wi, 9 n )I(sh + x < X { < (s + t)h + x ) 

< / < [rhej{9* n ,x)a]r n + C ( ,, _ ° j ||x - x || 7 > f(x) dx 

Jx£X,sh+x <x<(s+t)h+x I \\\ x x 0\\ / J 

rh e ,j(9* n , x + uh)a]r n /h y + C (t^j) IM| 7 ) /(x + uh)h dx+ ~< du 

I x +uheX,s<u<s+t (. \\\ U \ 

where the last equality uses the change of variables u = (x — x )/h. We also have 

Cr|(s/i + Xq, (s + t)h + Xq, 9 n ) 

= Emj(Wi, 9 n ) 2 I(sh + x < X { < (s + t)h + x ) - [Emj(Wi, 6 n )I( sh + x < X { < (s + t)h + x )f 



< / fxlj(x)f(x)dx= / fj%j(x + uh)f(xo + uh)h dx dx. 

J x£X ,sh+XQ<x<(s+t)h+XQ J xo-\-uh£X ,s<u<s+t 

using the same change of variables. Under these assumptions, /if (x + uh) converges to 
Ejj(xo) and /(x + uh) — >■ /(x ) uniformly over bounded u as h approaches 0. 

Thus, for any e > 0, we will have, for small enough h and bounded s and t, Errij(Wi, 9 n )I(sh+ 
xq < Xi < (s + t)h + xo)/<jj(sh + xo, th, 9 n ) is, for any s, t such that the expression is negative, 
bounded from above by 



f l d x /2+ 1 



lf(*o) 1/2 - e] I xo+uh ^ s<u<s+t {[moA0o,x o )a + e]r n /W + C (^) \\uV) du 



[Ejf(x) + e\vo\{u\x Q + uheX,s<u<s + t} 1 / 2 



Setting h = rv/ 7 , this is equal to 



r^/ 2+ ^X(s,t,(X-x )/rle) 
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for a function A that does not depend on r n . Note that the sequence of sets (X — a?o)/ r n 
satisfies (X — xo)/ r 1 ^ (X — x )/rj for r £ < r k by convexity of X, so, letting U = W^ =1 {X — 
%o)/ r k, we will have vol({s < u < s + 1] fl (U\(X — xo)/rk)) — > 0. It follows that X(s, t, (X — 
xo)/r n) e) — > X(s,t,U,e). Since this holds for all e and A is continuous in e, we have, for any 
s,t, 



EnmjjWi, 6 n )I{srn h + x < X { < (s + t)r„ /7 + x ) 
crjisri/' 7 + Xo^trl/ 1 ,8 n ) 



Ql-a 



EmjiWi, e n )I{sr yi + x <X i <(s + t)rl h + x ) _ 1/2 A 

~~ T~th 7Th~T\ h Up ^ n > ~ qi - a 

o-j{sr n +x , 

> rl dx / 2+ ^{-\(s,t,U) + o(l) - qi~ a /rl dx/2+ ^) + Op^ 2 ) 

= l-X(sXU) + - ^g^ (l + o P (l)) + P (n^ 2 r-J^ 2 ^) 

Note that the op(l) term absorbs the Op(l) term, so that the above expression will be 
negative with probability approaching one for some s, t as long as 

(21ogt~ dx ) 1/2 
limsup fe/ 2+7) / 7 < sup -\(s,t,U), 

and this condition can be rearranged to 

j/(d x +2-y) 

>-l/l 

■ s.t 



liminfr n ( j > -l/[inf A(s, t, W)] 7/(dx/2+7) 

2 login x 



D Comparison to Intermediate Gaussian Approxima- 
tions 

In this section of the appendix, we compare our approach to the results that could be 
obtained using intermediate gaussian approximations. As shown in Section HJ t n must be 
chosen at least as small as the optimal bandwidth in order for the test to have good power 
for a given data generating process. Theorem 12.11 allows t n to be chosen equal to rT x l dx 
times a logn term, which is small enough to adapt to any Holder class for the conditional 



mean. Using the best available results for gaussian approximations in iRiol ( 11994J ) would give 
a rate of approximation of a logn term times n~ l ^ 2<ydx+l ^ for the random process (s, t) t— > 
y/nE n m(Wi, 0)I(s < X; < s+t). The test statistic weights this by the inverse of its estimated 
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standard deviation which, at the minimum s cale £ n , is of order tn dx ^ '■ Thus, in order to 
use the gaussian approximation of Rio ( 1994 ). we would need t n dx ^ 2 ■ n~ 1 ^ 2( - dx+1 ^ to go to 
zero more quickly than a log n term, which would mean that t n would have to decrease more 



slowly than a logn term times n d x( d x+ 1 ) . For the test to achieve optimal power when the 
conditional mean has 7 conditinuous derivatives (where noninteger values of 7 corresond 
to Holder conditions), t n must decrease at least as quickly as n~ 1 ^ dx+2 ' y \ Thus, using a 
gaussian approximation would only lead to optimal power when , 1 9 < , , } n , which 



d x +2 7 - d x (d x +l)> 

can be rewritten as dx + 27 > d\ + dx or 7 > d x /2. Another approach is to restrict 
the set (s, t) over which the supremum is taken to a finite set and place conditions on the 
rate at which this set increases with the sample size. While this approach does not apply 
directly to the statistic consi dered here, i t is us eful to compare our results to this approach 
as well. Using the results of IChatterjee along with this app r oach a nd a method of 



proof that avoids deriving an asymptotic distribution, IChetverikovl (120121 ) provides a test 
that is adaptive in the range 7 G [dx, 2]. 

Since the use of positive kernels (in this case indicator functions) prevents multiscale 
statistics from being adaptive t o 7 > 2 de rivatives, this means that the approach based on 
the gaussian approximations in iRiol (119941 ) would be adaptive to a range of [d x /2, 2] for the 
smoothness parameter 7. Thus, while this approach would lead to useful (if not optimal) 
results for a one dimensional covariate, it would not be adaptive in two dimensions, and would 
be dominated by a kernel statistic with a single bandwidth in more than two dimensions. 
In contrast, our result allows adaptivity to all 7 in (0, 2] regardless of the dimension of 
which is the best possible result. 
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Figure 1: 95% Confidence Region Using Weighted Sup Statistic (this paper) 
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Figure 2: 95% Confidence Region Using Unweighted Statistic and Subsampling with Esti- 
mated Rate 
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Figure 3: 95% Confidence Region Using Unweighted Statistic and Subsampling with Con- 
servative Rate 
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Table 1: False Rejection Probabilities for Least Favorable Null 
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Table 2: Power for Level a = .05 Test with Critical Values Based on Finite Sample Least 
Favorable Distribution (Design 1) 
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Table 3: Power for Level a = .05 Test with Critical Values Based on Finite Sample Least 
Favorable Distribution (Design 2) 
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Table 4: Power for Level a = .05 Test with Critical Values Based on Finite Sample Least 
Favorable Distribution (Design 3) 
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di 


02 


Weighted Sup Statistic (this paper) 
Unweighted, Subsampling with Estimated Rate 
Unweighted, Subsampling with Conservative Rate 


[-30, 109] 
[-48, 84] 
[-60, 138] 


[0.0053,0.0320] 
[0.0113,0.0342] 
[0.0030,0.0372] 



Table 5: 95% Confidence Intervals for Components of 
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